R Advanced Course

Functions and parallel computation

Objective and contents

Objective and contents

Objective: to learn how to generate your own functions and perform parallel computations.

  • Writing and using functions in R
    • Extracting mean annual precipitation by country
  • Parallel computation in R
    • Parallelisation of processes

Writing and using functions in R

Writing Functions in R

R allows the user to create functions. The mode of these expressions is function, are stored in a special form, and might be used in further expressions.

A function is defined as follows:

function_name <- function(arg1, arg2, ...) expression

function_name is the name of the function defined by the user. We do not recommend to use existing function names.

argn are the different arguments of the function.

expression is the process that the function will follow when is applied.

Writing Functions in R

The function is then applied as follows:

function_name(arg1, arg2, ...)

Writing Functions in R

As an example, let’s generate a function that counts the characteres in a string. First we will start with generating the skeleton og the function and adding a parameter named word:

count_char <- function(word){
  
}

Then, we have to apply the nchar function.

count_char <- function(word){
  
  n <- nchar(word)
  
}

Writing Functions in R

Finally, we can include a message with the result. Note that we will use the return function to explicitly indicate the output of our function.

count_char <- function(word){
  
  n <- nchar(word)
  n <- paste("The word", word, "has", n, "characters!")
  return(n)
}

Writing Functions in R

Now let’s apply the function:

count_char(word = "Oscar Baez")
## [1] "The word Oscar Baez has 10 characters!"
count_char(word = "Spatial analysis")
## [1] "The word Spatial analysis has 16 characters!"
count_char(word = 10:14)
## [1] "The word 10 has 2 characters!" "The word 11 has 2 characters!"
## [3] "The word 12 has 2 characters!" "The word 13 has 2 characters!"
## [5] "The word 14 has 2 characters!"

Writing Functions in R

In the last example, we provided a numeric vector to the word parameter. We could control the input of the parameter. For example:

count_char <- function(word){
  
  if(!is.character(word))
    stop("The input is not a character!")
  
  n <- nchar(word)
  n <- paste("The word", word, "has", n, "characters!")
  return(n)
}

The stop function will exit the execution of the function if the condition is not met.

Writing Functions in R

count_char("Hello guys!")
## [1] "The word Hello guys! has 11 characters!"
count_char(1528)
## Error in count_char(1528): The input is not a character!
count_char("This is an awesome course!")
## [1] "The word This is an awesome course! has 26 characters!"

Writing Functions in R

How would you create a function that gives you the maximum value in a given vector?

Writing Functions in R

How would you create a function that gives you the maximum value in a given vector?

get_maximum <- function(v){
  
  mx <- max(v)
  return(mx)
  
}

If we apply it:

get_maximum(1:10)
## [1] 10
get_maximum(rnorm(100))
## [1] 2.195209

Aplying functions to gridded data

Now that we have created our own function, how can we apply it to a raster stack to get the maximum value per grid-cell over a defined period? For this purpose, we can use the app function from the terra package.

library(terra)

For this purpose, we will use monthly CHIRPSv2 data for 2000.

chirps_path  <- "C:/User/Data/CHIRPS_monthly"
chirps_files <- list.files(chirps_path, full.names = TRUE)
chirps       <- rast(chirps_files)

Aplying functions to gridded data

print(chirps)
## class       : SpatRaster 
## dimensions  : 2000, 7200, 12  (nrow, ncol, nlyr)
## resolution  : 0.05, 0.05  (x, y)
## extent      : -180, 180, -50, 50  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## sources     : chirps_monthly2000-01.tif  
##               chirps_monthly2000-02.tif  
##               chirps_monthly2000-03.tif  
##               ... and 9 more source(s)
## names       : chirp~00-01, chirp~00-02, chirp~00-03, chirp~00-04, chirp~00-05, chirp~00-06, ... 
## min values  :           0,           0,           0,           0,           0,           0, ... 
## max values  :    1251.456,    1467.558,    1222.492,    1224.301,    1322.237,    1584.747, ...

Aplying functions to gridded data

Now, we can apply the app function to obtain the maximum value over each grid-cell.

max_values <- app(chirps, get_maximum)
plot(max_values)

Extracting mean annual precipitation by country

Extracting mean annual precipitation by country

We will compute the mean annual precipitation using 12 monthly files of CHIRPSv2 over the Nile Basin countries in this example:

shape_path <- "c:/User/Countries/Nile_Basin_Countries_GAUL2014_2.shp" 
countries  <-vect(shape_path)

Extracting mean annual precipitation by country

print(countries)
##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 15, 9  (geometries, attributes)
##  extent      : 12.19545, 47.98618, -13.45568, 31.66734  (xmin, xmax, ymin, ymax)
##  source      : Nile_Basin_Countries_GAUL2014_2.shp
##  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
##  names       :       STATUS DISP_AREA ADM0_CODE   ADM0_NAME STR0_YEAR EXP0_YEAR
##  type        :        <chr>     <chr>     <int>       <chr>     <int>     <int>
##  values      : Member State        NO       205      Rwanda      1000      3000
##                Member State        NO       133       Kenya      1000      3000
##                Member State        NO        74 South Sudan      2011      3000
##  Shape_Leng Shape_Area  Name_label
##       <num>      <num>       <chr>
##       8.127      2.064      Rwanda
##       48.55      47.35       Kenya
##       46.91       51.6 South Sudan

Extracting mean annual precipitation by country

Therefore, we want to create a function that:

    1. Takes twelve monthly precipitation global rasters and generates a stack.
    1. Takes a shapefile with only one feature.
    1. Returns the mean annual precipitation over the selected feature.

Once that the function is ready, it can be applied to the different features of the shapefile.

Extracting mean annual precipitation by country

First, we can prepare the data that will be used as input in the function. We will save the raster paths in an object called raster_paths and the shapefile in an object named shape.

raster_paths <- "C:/Users/Data/CHIRPS_monthly/"
raster_paths <- list.files(raster_paths, full.names = TRUE)
shape        <- countries[1,]

Extracting mean annual precipitation by country

print(shape)
##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 1, 9  (geometries, attributes)
##  extent      : 28.86187, 30.89965, -2.839881, -1.047913  (xmin, xmax, ymin, ymax)
##  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
##  names       :       STATUS DISP_AREA ADM0_CODE ADM0_NAME STR0_YEAR EXP0_YEAR
##  type        :        <chr>     <chr>     <int>     <chr>     <int>     <int>
##  values      : Member State        NO       205    Rwanda      1000      3000
##  Shape_Leng Shape_Area Name_label
##       <num>      <num>      <chr>
##       8.127      2.064     Rwanda

Extracting mean annual precipitation by country

Then, we can create the function. We will call it get_map for mean annual precipitation.

get_map <- function(){
  
}

Extracting mean annual precipitation by country

This function, requires two inputs, the 12 paths of the global precipitation rasters (one for each month o the year), and the shapefile. We can give the paths of the objects and read them within the function.

get_map <- function(raster_paths, shape){
  
}

Extracting mean annual precipitation by country

At the end, we will assign these objects to the function i.e., raster_paths = raster_paths and shape = shape. The first step would be to import the raster data using the rast function from the terra package.

get_map <- function(raster_paths, shape){
  
  # reading the gridded data
  p_rast <- terra::rast(raster_paths)
  
}

Extracting mean annual precipitation by country

The second step would be to crop and mask the raster stack to the extent of the shapefile.

get_map <- function(raster_paths, shape){
  
  # reading the gridded data
  p_rast <- terra::rast(raster_paths)
  
  # cropping and masking the raster stack
  p_rast <- terra::crop(p_rast, shape, snap = "out")
  p_rast <- terra::mask(p_rast, shape)
  
}

Extracting mean annual precipitation by country

Finally, we can sum the 12 precipitation layers and return the result.

get_map <- function(raster_paths, shape){
  
  # reading the gridded data
  p_rast <- terra::rast(raster_paths)
  
  # cropping and masking the raster stack
  p_rast <- terra::crop(p_rast, shape, snap = "out")
  p_rast <- terra::mask(p_rast, shape)
  
  # adding the P layers
  p_rast <- sum(p_rast)
  
  return(p_rast)
  
}

Extracting mean annual precipitation by country

Once we have the function ready, we can apply it:

rwanda_p <- get_map(raster_paths = raster_paths, shape = shape)
plot(rwanda_p)

Parallel computation in R

Parallel computation in R

Processing large amounts of spatial data can be very time consuming; however, most of the R code runs on a single processor. Most of the time this is not a problem but sometime, these computations can be:

  • Time consuming

  • Memory consuming

  • Great time investment in reading or writing files

  • Great time in transferring data

Traditional computers have a single CPU, that in turn can contain multiple cores. These processors and cores are able to perform computations (note that modern computers can have multiple processors).

Parallel computation in R

The following function generates a list of prime numbers up to a designed number. This function was taken from stackoverflow.

prime_numbers <- function(n){
   n <- as.integer(n)
   if(n > 1e6) stop("n too large")
   primes     <- rep(TRUE, n)
   primes[1]  <- FALSE
   last.prime <- 2L
   for(i in last.prime:floor(sqrt(n)))
   {
      primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
      last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
   }
   which(primes)
}

prime_numbers(100)
##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97

Parallel computation in R

We will use the function Sys.time to assess the time needed by this function to retrieve prime numbers for a sequence starting in 10 and ending in 15,000.

sequence <- 10:15000

# creating an empty vetor
res <- c()

system.time({
  for(i in sequence){
    res[[i]] <- prime_numbers(i)
  }
})
##    user  system elapsed 
##  63.784   0.112  63.911

Parallel computation in R

To parallelise the code we will use the doParallel package. This package provides functions for parallel execution of R code on machines with multiple cores or processors or multiple computers.

library(doParallel)

We have to indicate the number of cores to be used. For this purpose, we can use the detectCores function.

detectCores()
## [1] 8

Parallel computation in R

It is not recommended to use all cores so your computer don’t freeze; therefore:

cores <- detectCores() - 1

Afterwards, we have to register the parallel backend. For this purpose, we will use the registerDoParallel function.

registerDoParallel(cores = cores)

Parallel computation in R

Finally, the for iterative process will be different. Instead of for, we will use foreach; instead of in we will use =. Afterwards, we can explicitly indicate which packages will be used with .packages(we will go through this in the example). Finally, the operator %dopar% should be placed before opening the foreach.

The function stopImplicitCluster can be used in vignettes and other places where it is important to explicitly close the implicitly created cluster.

Parallel computation in R

sequence <- 10:15000

# Parallel 
cores <- detectCores() - 1
registerDoParallel(cores = cores)

# creating an empty vector
res <- c()

system.time({
  foreach(i = sequence) %dopar% {
      res[[i]] <- prime_numbers(i)
  }
})
##    user  system elapsed 
## 144.602   3.066  25.411
stopImplicitCluster()

Parallelisation of processes

Parallelisation of processes

Coming back to the example of extracting mean annual precipitation by country, let’s obtain the mean annual precipitation for all Nile Basin countries.

plot(countries, axes = TRUE)

Parallelisation of processes

Checking at the data, we can see that there are disputed areas in the shapefile. As we want only the countries we can exclude them.

countries <- countries[which(countries$DISP_AREA == "NO"),]
data.frame(countries)
##          STATUS DISP_AREA ADM0_CODE                        ADM0_NAME STR0_YEAR
## 1  Member State        NO       205                           Rwanda      1000
## 2  Member State        NO       133                            Kenya      1000
## 3  Member State        NO        74                      South Sudan      2011
## 4  Member State        NO       257      United Republic of Tanzania      1000
## 5  Member State        NO       253                           Uganda      1000
## 6  Member State        NO        79                         Ethiopia      1000
## 7  Member State        NO        77                          Eritrea      1000
## 8  Member State        NO        43                          Burundi      1000
## 9  Member State        NO        68 Democratic Republic of the Congo      1000
## 10 Member State        NO         6                            Sudan      2011
## 11 Member State        NO     40765                            Egypt      1000
##    EXP0_YEAR Shape_Leng Shape_Area                       Name_label
## 1       3000   8.127003   2.063852                           Rwanda
## 2       3000  48.549430  47.345770                            Kenya
## 3       3000  46.905431  51.599166                      South Sudan
## 4       3000  82.950601  76.860266      United Republic of Tanzania
## 5       3000  23.244416  19.629674                           Uganda
## 6       3000  50.380131  92.869258                         Ethiopia
## 7       3000  50.005783  10.167958                          Eritrea
## 8       3000   9.118459   2.185652                          Burundi
## 9       3000  98.951177 189.998107 Democratic Republic of the Congo
## 10      3000  81.910242 155.888802                            Sudan
## 11      3000  61.251157  89.079113                            Egypt

Parallelisation of processes

Now, we can apply the function get_map for every country. As this process surely will be time consuming we can paralelise it. First, here is the get_map function.

get_map <- function(raster_paths, shape){
  
  # reading the gridded data
  p_rast <- terra::rast(raster_paths)
  
  # cropping and masking the raster stack
  p_rast <- terra::crop(p_rast, shape, snap = "out")
  p_rast <- terra::mask(p_rast, shape)
  
  # adding the P layers
  p_rast <- sum(p_rast)
  
  return(p_rast)
  
}

Parallelisation of processes

First we store the raster file paths into an object as we did before. Then we can set the output path where we want to store the resuting files.

Parallelisation of processes

Then we construct the foreach loop as follows:

  • Set the number of cores with detectCores

  • Register the cluster with registerDoParallel

  • Start the foreach iterative process

  • Indicate the packages that will be used. In this case terra

  • Include the parallel operator %dopar%

  • Develop the process and export the raster data

  • Close the loop

  • Close the parallel cluster if needed with stopImplicitCluster

Parallelisation of processes

cores <- detectCores() - 1
registerDoParallel(cores = cores)

foreach(i = 1:nrow(countries), .packages = "terra") %dopar% {
  
  r <- get_map(raster_paths = raster_paths, shape = countries[i,])
  
  n <- countries[i,]$Name_label
  n <- paste0(output, "/", n, ".tif")
  
  writeRaster(r, n, overwrite = TRUE)
  
}
stopImplicitCluster()

Parallelisation of processes

After the process, we should have the rasters in the output folder.